From: Damon Horowitz [mailto:damon@media.mit.edu] Sent: Thursday, July 13, 2000 10:48 PM To: Gregory Chase Subject: Re: Question regarding taxonomies Greg - I would be happy to try to help you with your question, but I do not think that there is a simple answer. First of all, the notion of an "a priori taxonomy of knowledge" is a bit confused... no such thing exists, according to the relativist view, and if it did the "taxonomy" part would probably be a posteriori... as well as being as boundless and ill-defined as "knowledge". But on a more practical side: Most AI and natural language technologies do not concern themselves with such philosophical conundrums, but instead restrict their attention to limited domains of "expert knowledge" and restricted vocabularies and common linguistic conventions, simply because only with these types of caveats can one hope to get enough regularities that you can create a usefully functioning model, using statistics, rules, or black magic. To understand how vast unlimited knowledge would be, consider, say, the Yahoo classification scheme. First you have their arbitrary breakdown of a few dozen categories, then these repeated by region, then *these* perhaps separated into commercial and non-commercial endeavors, etc. Any combination of terms or concepts creates another level of unique concepts, with their own keywords, etc. Further, all of the keywords that you want to search on are themselves of course types of concepts that need to be categorized in the tree. Lastly, there is a huge problem of polysemy in language that makes this type of general purpose keyword matching nearly useless. At the last company I worked at, we did extensive work in this area. We tried using everything from the 65K yahoo categories, to Wordnet, to commerical taxonomies such as erli/lexiquest, to more specific domain vocabularies such as UMLS, etc. So, sorry for going on so long, but the short answer is: I don't know of something that satisfies your requirements, and I believe that this is because the requirements are ill-formed. One benefit of a machine learning approach, which you refer to, is that it can learn that sub-domain relevant to a particular client, or a market segment that you want to penetrate. Much more likely to work a little bit than the "general" solution. However, given what I know of your product offering from discussions with Max, I tend to think that you may be better off with a much simpler approach, namely, simply creating a small vocabulary of terms that you care about (again, for a specific target application) and doing keyword matching against these. If you are uncertain of what terms are important, you could collect statistics of the occurence of *all* terms in example content you want to analyze, and then manually review these lists (perhaps filtering out common words, etc.) Good luck with your project, let me know if you have further questions, or perhaps if you want a more involved consultation we could set something up with Max... time, of course, being the main difficulty... - Damon